STA4173: Biostatistics
Spring 2025
In the last lecture, we focused on describing data.
Today, we will focus on drawing conclusions about two population means using data.
Point Estimate
The single value of a statistic that estimates the value of a parameter.
Examples of point estimates:
It is necessary to know how good our estimation is, or to quantify our uncertainty.
Confidence Interval
A range of plausible values for the parameter based on values observed in the sample.
\text{estimate} \pm \text{margin of error}
Level of Confidence
The probability that the interval will capture the true parameter value in repeated samples. i.e., the success rate for the method.
(1-\alpha)100\% confidence interval for \mu_1-\mu_2
(\bar{x}_1 - \bar{x}_2) \pm t_{\alpha/2} \sqrt{\frac{s_1^2 }{n_1} + \frac{s_2^2}{n_2}} where t_{\alpha/2} has \text{min}(n_1-1, n_2-1) degrees of freedom.
Let’s find the 95% confidence interval for the difference in average weight (body_mass_g) between male and female (sex) penguins.
Remember the R syntax:
What is the continuous variable?
What is the grouping variable?
What is the dataset name?
What is the confidence level?
Welch Two Sample t-test
data: body_mass_g by sex
t = -8.5545, df = 323.9, p-value = 4.794e-16
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
-840.5783 -526.2453
sample estimates:
mean in group female mean in group male
3862.273 4545.685
Welch Two Sample t-test
data: body_mass_g by sex
t = -8.5545, df = 323.9, p-value = 4.794e-16
alternative hypothesis: true difference in means between group female and group male is not equal to 0
99 percent confidence interval:
-890.4112 -476.4124
sample estimates:
mean in group female mean in group male
3862.273 4545.685
\begin{align*} P[\text{T, T, T, T, T}] &= 0.5 \times 0.5 \times 0.5 \times 0.5 \times 0.5 \\ &= 0.03125 \end{align*}
Hypothesis Testing
A procedure, based on sample evidence and probability, used to test statements regarding a characteristic of one or more populations.
Steps in hypothesis testing
Make a statement regarding the nature of the population.
Collect evidence (sample data) to test the statement.
Analyze the data to assess the plausibility of the statement.
Note: if we have population parameters available, we do not need to perform a hypothesis test.
Hypothesis
A statement regarding a characteristic of one or more populations.
Null hypothesis, H_0
A statement to be tested.
Alternative hypothesis, H_1
A statement that we are trying to find evidence to support.
Hypothesis Test for Two Independent Means
Hypotheses
Test Statistic t_0 = \frac{\bar{x}-\mu_0}{\frac{s}{\sqrt{n}}}
p-Value
Rejection Region
Conclusion/Interpretation
[Reject or fail to reject] H_0.
There [is or is not] sufficient evidence to suggest [alternative hypothesis in words].
Consider the penguin data. Is there a significant difference in weight (body_mass_g) between male and female penguins? Test at the \alpha=0.05 level.
Remember the R syntax:
What is the continuous variable?
What is the grouping variable?
What is the dataset name?
What is the hypothesized difference?
What is the alternative?
Consider the penguin data. Is there a significant difference in weight (body_mass_g) between male and female penguins? Test at the \alpha=0.05 level.
Remember the R syntax:
Welch Two Sample t-test
data: body_mass_g by sex
t = -8.5545, df = 323.9, p-value = 4.794e-16
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
-840.5783 -526.2453
sample estimates:
mean in group female mean in group male
3862.273 4545.685
Hypotheses
Test Statistic and p-Value
Rejection Region
Conclusion/Interpretation
Reject H_0.
There is sufficient evidence to suggest that male and female penguins have different weights.
Independent data
An individual selected for one sample does not dictate which individual is to be in a second sample.
In the data, there is not a way to link the individuals in the sample.
Dependent data
An individual selected to be in one sample is used to determine the individual in the second sample.
In the data, there is a way to link the individuals in the sample.
We are now interested in comparing two dependent groups.
We assume that the two groups come from the same population and are going to examine the difference,
d = y_{i, 1} - y_{i, 2}
\mathbf{(1-\boldsymbol\alpha)100\%} confidence interval for \mathbf{\boldsymbol\mu_d}
\bar{d} \pm t_{\alpha/2} \frac{s_d}{\sqrt{n}}
R syntax:Construct the 95% confidence interval for the average difference between the two garages.
Remember the R syntax:
Paired t-test
data: garage$g1 and garage$g2
t = 6.0234, df = 14, p-value = 3.126e-05
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
0.3949412 0.8317254
sample estimates:
mean difference
0.6133333
The 95% CI for \mu_d, where d = x_{\text{I}} - x_{\text{II}} is (0.39, 0.83).
From the problem statement:
Can we say that estimates from garage I are higher than those from garage II?
Hypothesis Test for Two Dependent Means
Hypotheses
Test Statistic t_0 = \frac{\bar{d}-\mu_0}{\frac{s_d}{\sqrt{n}}}
P-Value
Rejection Region
Conclusion/Interpretation
[Reject or fail to reject] H_0.
There [is or is not] sufficient evidence to suggest [alternative hypothesis in words].
t.test() function.
Let’s now formally determine if garage I’s estimates are higher than garage II’s. Test at the \alpha=0.05 level.
Recall the data,
R syntax:
Paired t-test
data: garage$g1 and garage$g2
t = 6.0234, df = 14, p-value = 1.563e-05
alternative hypothesis: true mean difference is greater than 0
95 percent confidence interval:
0.4339886 Inf
sample estimates:
mean difference
0.6133333
Hypotheses
Test Statistic and p-Value
Rejection Region
Conclusion/Interpretation
Today we reviewed statistical inference.
Get to know you quiz - complete with RStudio - due today.
Next meeting: how to conceptualize research questions.